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Intelligent Data Understanding Grou 


The IDU group develops novel algorithms to 
detect, classify, and predict events in large 


data streams for scientific and engineering 


systems. 
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• In early January 2007, ISS Early External Thermal 
Control System developed an ammonia gas bubble 

• Bubble noted by ISS controllers only ~9 hours before it 
"burst" and dissipated back into liquid 


Key areas of research in data minin 


Research Topic Areas 

• Anomaly Detection 

• Prediction Systems 

• Text Mining 

• Mining Distributed Data Systems 
and Sensor Networks 

• High Performance Time Series 
Search 


Application Areas 

• Safety critical systems 

• Large scale distributed systems 

• Earth Sciences 

• Space Sciences 

• Systems Health Data from 
Aeronautical and Space Systems 


Chapman & Hall/CRC 

Data Mining and Knowledge Discovery Series 



Chapman & HaJl/CRC 

Data Mining and Knowledge Discovery Series 


Text Mining 

Classification, 
Clustering, and 
Applications 


Ashok N. Srivastava 
Mehran Sahami 





NASA Data Systems 


• Earth and Space Science 

- Earth Observing System generates ~21 TB of 
data per week. 

- Ames simulations generating 1-5 TB per day 

• Aeronautical Systems 

- Distributed archive growing at 100K flights per 
month with 2M flights already. 

• Exploration Systems 

- Space Shuttle and International Space station 
downlinks about 1.5GB per day. 


Developing Virtual Sensors 


• Virtual Sensors predict the value of one 
sensor measurement by exploiting the 
nonlinear correlations between its values 
and other sensor readings. 

• Useful for emulating sensors back in time 
or estimating the value of one sensor 
based on other sensor measurements 


Z: Sensors measurements 
X: Wavelength or Frequency 
u: Position 

Z(xi,X,t)=[Z u (X,t)\ 

- [Z Ul (X,t),Z U2 (X,t),. . .,Z Un (X, t)] T 


Mm = 


T{Z{B))Z{B)dB 


Predicted Sensor 
Measurement 
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Uncertainty 


Earth and Space Sciences 



Aeronautics and Space Systems 





Virtual Sensors in the 
Earth Sciences 


Collaborators 

Ashok N. Srivastava, NASA Ames 
Nikunj C. Oza, NASA Ames 

Julienne Stroeve, National Snow and Ice Data Center 
Ramakrishna Nemani, NASA Ames 
Petr Votava, NASA Ames 


Has Cloud Cover Changed over 
Greenland in the past 30 years? 



• New sensors on the MODIS system can detect clouds over snow and ice in the 
1.6jum band (circa 1999). 

• Difficult over snow and ice-covered surfaces because of low contrast in visible 
and thermal infrared wavelengths. 

• Older sensors from the AVHRR system do not detect cloud cover over snow 



Joint work with Nikunj Oza, Julliene Stroeve, Rama Nemani, Brett Zane-Ulman 


Cloud Detection back in Time 



• MODIS 1.6pm has enough contrast for this task. 

• However 1.6pm channel not available in AVHRR/2. 

• Predict 1.6pm channel using a Virtual Sensor 

MODIS 



Model Application 


Accuracy Results for Three Model 
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• True Positive = number of times channel 6 indicated a 
cloud and the model predicted cloud 

• True Negative = number of times channel 6 indicate no 
cloud and the model predicted no cloud 


rate 



Verification of Models on MODIS Data 





SVM Fi B F kernel 


SVM MDMK kernel 






Application of Models to AVHRR Dat 
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Summary 

• Application to entire historical record is a significant task 
because of data quality issues and transitions from one sensor 
system to another. 


• Method applied to emulation of physics models to calculate 
corrections for surface albedo measurements resulted in an 
increase in speed by factor of 27 compared to existing methods. 


• Potential to deploy Virtual Sensors for generation of a historical 
cloud mask record. 


• Model verification and validation must be done by hand since 
we have no signal for comparison. 



A. N. Srivastava, N. C. Oza, and J. Stroeve, "Virtual Sensors: Using Data Mining 
Techniques to Efficiently Estimate Remote Sensing Spectra ," Special Issue on Advanced 
Data Analysis, IEEE Transactions on Geoscience and Remote Sensing, March 2005. 



Virtual Sensors in 
Astrophysics 


Collaborators 

Michael J. Way NASA Goddard Institute of Space Science 
Leslie Foster, San Jose State University 
Ashok N. Srivastava, NASA Ames 
Paul Gazis, NASA Ames 
Jeffery Scargle, NASA Ames 


Declination 


Estimating Photometric Redshifts 
in the Sloan Digital Sky Survey 
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Joint work with Michael J. Way Leslie Foster, Paul Gazis, and Jeffrey Scargle 
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Photometric Redshifts are Broadband 
Measurements of Spectra 


NGC5102 and SDSS Filters 


NGC5102 and SDSS Filters 
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Gaussian Process Regression 


Can have high accuracy and also measure of uncertainty 

some low-rank matrix approximations work well but 

can have numerical problems. 


Training Data: 

« X - data matrix of observations - n x d 
a y - vector of target data - n x 1 
Testing Data: 

» X* - matrix of new observations -n* x d 
Goal: 

« predict y* corresponding to X* 


« Form covariance matrix K (n x n), 
cross covariance matrix K* ( n * x n) and 
select parameter A 

« predict y* using 

y* = K*( A 2 / + K)"V 

9 the n x n matrix (A 2 / + K) is large for large 
data sets 


9 Memory: Storing covariance matrix - 0 (n 2 ) 
9 Time: Solving linear system - 0 (n 3 ) 

» Numerical stability: accurate calculations. 


Standard Least Squares Proble 


o Given: 

n x m matrix A, n> m 
n x 1 vector y 

o Solve min \ \y - Ax\\ 

o Normal Equations: x = {A T A)~^A T y 
potential numerical instabilities 

o QR: A = QR, x = R _1 Q r y 
stable calculation 


Computational Challenges 

Subset of Regressors [Wahba, 1990] 
y* S K*(X 2 Km + KjK^-'Kjy 
Memory: Storing covariance matrix - O(nm) 

Time: Solving linear system - 0(nm 2 ) 

Numerical stability: ???. 


Cures for Numerical Instability: The V-Meth 


Approach 

1. Select columns to make l. 

well conditioned 

2. Use stable technique 

for least squares 2. 

problem such as 

• QR factorization 3. 

• V method 

3. Requirement: maintain 
O(nm) memory use and 
0(nm 2 ) efficiency. 


Column Selection 

Use Cholesky factorization 
with pivoting to partially 
factor K 

selects appropriate 
columns for 

K t will be well conditioned 
if condfKj is 0(condition 
of optimal low rank 
approximation). 


The V-Method is the innovation of Leslie Foster and his students at San Jose State University 


The V-Method 


o Factor K\ = VV^ where V is n x m and Vn 
is m x m lower triangular 

O y* = K;V^ t {\ 2 I + V T V)- 1 V T y 

o \/ is a rescaling of a well conditioned matrix 
o method is numerically stable 


o can be faster and need less memory 

o related to [Peters and Wilkinson, 1970], 
[Wahba, 1990] 
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Our ensemble models 
produce the best 
redshift estimates 
published to date. 

We are developing 
Gaussian Process 
Regression methods to 
scale to 10 6 galaxies 
and beyond. 


•21 






Scalability Results 


Data Set 1 : u-g-r-i-z, RANK=200 



Data Set 1 : u-g-r-i-z, RANK=400 
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Data Set 1: u-g-r-i-z, RANK=800 



Best Published Results so far* 
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Data Set 1 : u-g-r-i-z, Sample=40000 
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Data Set 1 : u-g-r-i-z, Sample=80000 



* To the best of our knowledge 


Results for Redshift Predictions 


• The V-Formulation provides an extremely 
scalable and numerically stable method to 
compute Gaussian Process Regression for 
arbitrary kernels. 

• With low-rank matrix inversion approximations 
GPs performed better than all other methods. 

• Allows us to compute GPs for O(200K) points in 
a few seconds on a standard desktop PC. 



L. Foster, A, A. Waagen, N. Aijaz, M. Hurley, A. Luis, J. Rinsky, C. Satyavolu, M. J. Way, P. 
Gazis, and A. N. Srivastava, "Stable and Efficient Gaussian Process Calculations," Journal 
of Machine Learning Research, 10(Apr):857--882, 2009. 
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\ INTEGRATED VEHICLE 
/ HEALTH MANAGEMENT 


Data Mining 
Supporting the 

Flight Readiness Review for STS-119 


Collaborators 

Ashok N. Srivastava, NASA Ames 
Dave Iverson, NASA Ames 
Bryan Matthews, SGT 
Bill Lane, NASA Johnson Space Center 
Bob Beil, NASA Kennedy Space Center 


Overview 



• Ashok received a request to support the Flight Readiness Review for STS-119 which 
was scheduled for 2/20/09 as the Data Mining Subject Matter Expert. 


• Data mining algorithms developed at NASA were applied to these data to 

determine whether any anomalies can be detected in STS-126 and its predecessor 
flight STS-123 for Space Shuttle Endeavor. 





Algorithms and Data 



• IMS (Inductive Monitoring System): a data point 
is anomalous if it is far away from clusters of 
nominal points. 

• Orca: a data point is anomalous if it is far away 
from its nearest neighbors. 

• Virtual Sensor: a data point is anomalous if the 
actual value is far away from the predicted value. 

• Data: 13 pressure, temperature, and control 
variables related to the Flow Control Valve 
subsystem. 


IMS Anomaly Score 


STS-123 FCV Pressures IMS Analysis 
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IMS Anomaly Score 


STS-126 FCV Pressures IMS Analysis 
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IMS Anomaly Score 


STS-126 FCV Pressures IMS Analysis 
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Virtual Sensor: STS-118 and STS-126 
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A. N. Srivastava, B. Matthews, D. Iverson, B. Beil, and B. Lane, "Multidimensional 
Anomaly Detection on the Space Shuttle Main Propulsion System: A Case Study," 
submitted to IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2009. 
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INTEGRATED VEHICLE 
HEALTH MANAGEMENT 


The Role of Data Mining in 
Aviation Safety 

Ashok N. Srivastava, Principal Investigator 
Claudia Meyer, Project Manager 
Robert Mah, Project Scientist 


Integrated Vehicle Health Management: 
An Aviation Safety Project 


Level 4 - Aircraft Level 


Goal -- Validated multidisciplinary integrated vehicle health management tools 
and techniques to enable automated detection, diagnosis, prognosis and mitigation of 


IVHM 4.1 Ground/ 

IVHM 4.2 Systems 

adverse events during flight. 

IVHM 4.3 

IVHM 4.4 Research 

Flight Demo 

Analysis 


Dasftlink 

Test and Integration 



IVHM 3.1 
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Level 2 - 
Subsystems 


IVHM 2.1 Aircraft 


IVHM 2.2 
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Level 1 - 
Foundational 


IVHM1.1 Advanced 
Sensors 
and Materials 


IVHM1.2 Modeling 


IVHM1.3 Advanced 
Analytics and 
Complex Systems 


IVHM 1.4 Verification 
and Validation 
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IVHM Covers a broad range of technolo 
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Data Mining in Support of Global Operations 



Air Navigation Operations and Support J 

Flight Opera tions anti Support 


Weather Assimilated 
into Decision-Making 


Broad-Area 
Precision Navigation 


Global 

Harmonization 


Local/ State 

Community 



c ~ Equivalent Performs nce-Bas«<l 

Environment vtaJopwrfen. Serv.ces 


-2 


Enterprise Service! 


Network-Enabled 
Information Access 


Safety 


Qu*«iion*iCafrint«nrt 

Ada ptive Security ((l| J££gZ ^ 


io - 5 io - 4 


10- 3 


10 


10 


-1 


10 ° 10 ' 


10 2 10 3 


Layered 
iptiva Se 

io 4 


10 5 




DASHIink.arc.nasa.gov 

DASHIink harnesses the power of web 2.0 to further Systems Health 

and Data Mining research 
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Organization of IVHM 


•Project Operations 

Manager: Jeff Rybak 


•NRA 

Manager: Lilly Spirkovska 


Principal Investigator: Ashok Srivastava 
Project Scientist: Robert Mah 
Project Manager: Claudia Meyer 


ARC, DFRC, GRC, LaRC Center 
POCs 


•Level 4 


•Multidisciplinary Ground/ 
Flight Demos 

•Leads: PI, PS, PM 



•Systems Analysis for 
Health Management 

Lead: Mary Reveley 





ARC APM: Steve Jacklin 
DFRC POC: Mark Dickerson 
GRC APM: Bob Kerczewski 
LaRC APM: Sharon Graves 




•Research Test and 
Integration 

Lead: Robert Mah 



•DASHIink 

Lead: Elizabeth Foughty 



•Level 3 

Associate Principal Investigators 


•Detection 


•Diagnosis 


•Prognosis 


•Mitigation 


•Integrity Assurance 


API: John Lekki 


API: Rick Ross 


API: Kai Goebel 


API: Eric Cooper 


API: Eric Cooper 


•Level 2 



Aircraft Systems 


Airframe 


Propulsion Systems 




Traditional Aircraft Subsystems - well represented in Levels 1,3 and 4 


Software 
Lead: Paul Miner 


Newly Recognized Aircraft Subsystem 


•Level 1 Lead Researchers 


•Advanced Sensors and 


•Modeling 


•Advanced Analytics and 


•Verification and Validation 

Materials 



Complex Systems 


Lead: Steve Jacklin 

Lead: Tim Bencic 


Lead: Kevin Wheeler 


Lead: Nikunj Oza 





The Data Mining Team 



Group Members 

Kanishka Bhaduri, Ph.D. 
Santanu Das, Ph.D. 
Elizabeth Foughty 
Dave Iverson 
Rodney Martin, Ph.D. 
Bryan Matthews 
Nikunj Oza, Ph.D. 
MarkSchwabacher, Ph.D. 
John Stutz 

David Wolpert, Ph.D. 


Funding Sources 

NASA Aeronautics Research Mission 
Directorate- IVHM Project 

NASA Engineering and Safety Center 

Exploration Systems Mission Directorate 
Exploration Technology Development 
Program, ISHM Project 

Science Mission Directorate 


Team Members are NASA Employees, Contractors, and Students. 


•40 



APPENDIX 


Virtual Sensors Approach 

• Given MODIS channels 1, 2, 20, 31, 32 correspond to five AVHRR/2 channels 

• Develop a model for MODIS channel 6 (1.6mm) as a function of these channels 

• Use function to construct estimate of 1.6mm channel for AVHRR/2 



Model Construction 


AVHRR 1,2,3,4,5 



AVHRR 6 



Model Application 


Characterizing the Large Scale 
Structure of the Universe 



SDSS DR3 GREAT Spectra 



There are between 125 and 500 billion 
galaxies in the universe. 

Obtaining a good estimate of their 3-D 
position in the sky would help determine 
the filamentary structure of the universe 
to constrain cosmological models. 
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We are building machine learning methods 
to estimate the redshift of galaxies using 
broad-band photometry. 

If these estimates are of high enough 
accuracy, it would enable a better 
understanding of how the universe evolved after the Big Bang. 
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What are Photometric Redshifts? 


Photometric Redshifts: A rough estimate of the redshift of 
a galaxy without having to measure a spectrum. 
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The Empirical Approach to 
Redshift Estimation 

Training sample consists of galaxies with 

• known spectroscopic redshift 

• a comparable range of magnitudes (u g r i z) to our photometric 
survey objects 

Galaxy Photometric Redshift Prediction History 

• Linear Regression was first tried in the 1960s 

• Quadratic & Cubic Regression (1970s) 

• Polynomial Regression (1980s) 

• Neural Networks (1990s) 

• Kd Trees & Bayesian Classification Approaches (1990s) 

• Support Vector Machines & GP Regression (2000s) 



Kernels Incorporate Prior Knowledge 


Gaussian Process Regression 


A large # of hidden units in a Neural Network 

I 

Gaussian Process Regression (Neal 1996). 



Johann Carl Friedrich Gauss (1777— 
1855), painted by Christian Albrecht 


Inputs Hidden Jensen (wikipedia) 

Units 



Large Scale Gaussian Processes 



With our SDSS (DR3) Main Galaxy spectroscopic sample 
(180,000 galaxies) the matrix size is 180,000 x 180,000 

• Need a supercomputer with a LOT of ram and cpu time? 

• One can take a random sample of ~1000 galaxies & invert that 
while bootstrapping n times from full sample 

• However, some low-rank matrix approximations work well 

such as Cholesky Decomposition, Subset of Regressors but can 
have numerical problems. 

• Solution: V-method (Cholesky decomposition with pivoting) 


The V-Method is the innovation of Leslie Foster and his students at San Jose State University 


Numerical Instability in 
Subset of Regressors Method 


In SR formula consider special case A = 0 
y* = K 1 *(K 1 7 "Ki) -1 K 1 r y 


Exactly normal equations solution to the 
least squares prediction problem: 


min 


y - K-|X|| and y* = K\x 
Note: can be easily extended for A / 0 


Potential numerical instability 


Low Rank Approximations 
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Results from Other Authors 



Method Name 

® rms 

Dataset 1 

Inputs 2 

Source 

CWW 

0.0666 

SDSS-EDR 

UgTiz 

Csabai et al. (2003) 

Bruzual- Chariot 

0.0552 

SDSS-EDR 

ugriz 

Csabai et id. (2003) 

ClassX 

0.0340 

SDSS-DR2 

ugriz 

Suchkov et al. (2005) 

Polynomial 

0.0318 

SDSS-EDR 

ugriz 

Csabai et al. (2003) 

Support Vector Machine 

0.0270 

SDSS-DR2 

ugriz 

Vadadekar (2005) 

Kd-tree 

0.0254 

SDSS-EDR 

ugriz 

Csabai et al. (2003) 

Support Vector Machine 

0.0230 

SDSS-DR2 

iigriz-br50-br9G 

Wadadekar (2005) 

Artificial Neural Network 

0.0229 

SDSS-DR1 

ugriz 

Collistcr k, Lahav (2004) 


•Stanford 08 


Summary of Our Results 


Results: SDSS (DR3) Main Galaxy Sample 

Paper I: Compared linear, quadratic. Neural Networks 
and GPs on the SDSS 

With ONLY 1000 samples GPs performed well 
compared to the other methods 

Paper II: With low-rank matrix inversion 
approximations GPs performed better than all other 
methods 


•Stanford 08 


Virtual Sensor: STS-123 and STS-126 


OV-1 05-1 301 -STS-123 
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Summary of Research Needs in Aviation Safety 



• Aircraft aging and durability 

- Full fundamental knowledge about legacy aircraft 

- Start on knowledge about likely emerging materials and structures 

• On-board system failures and faults - airframe, propulsion, aircraft systems (physical and 
software) 

- Early prediction, detection and diagnosis 

- Prognosis 

- Mitigation 

• Monitoring for problems before they become accidents 

- Vehicle issues 

- Airspace issues 

• Loss-of-control 

- Understanding aircraft dynamics of current and future vehicles in damaged and upset 
conditions 

- Control systems robust to the unanticipated and anticipated 

- Aircraft guidance for emergency operation 

• Flight in hazardous conditions 

- Modeling and sensing airframe and engine icing and icing conditions 

- Sensing and portraying environmental hazards 

• New operations 

- Design of robust collaborative work environments 

- Design of effective, robust human-automation systems 

- Information management and portrayal for effective decision making 



Integrated Vehicle 
Health 

Management 



The Powers of Aviation Safety - 10' 6 - 10 6 



• There is no one 'silver bullet' - we must look 
at all contributors to safety 

• Consider the space we must consider: 

— Safety at the smallest level 
— Safety spanning the nation (and the world!) 

• Let us consider these different sizes, expressed 
as 'Powers of Ten' 
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Organization of IVHM 


•Project Operations 

Manager: Jeff Rybak 


•NRA 

Manager: Lilly Spirkovska 


Principal Investigator: Ashok Srivastava 
Project Scientist: Robert Mah 
Project Manager: Claudia Meyer 


ARC, DFRC, GRC, LaRC Center 
POCs 


•Level 4 


•Multidisciplinary Ground/ 
Flight Demos 

•Leads: PI, PS, PM 



•Systems Analysis for 
Health Management 

Lead: Mary Reveley 





ARC APM: Steve Jacklin 
DFRC POC: Mark Dickerson 
GRC APM: Bob Kerczewski 
LaRC APM: Sharon Graves 




•Research Test and 
Integration 

Lead: Robert Mah 



•DASHIink 

Lead: Elizabeth Foughty 



•Level 3 

Associate Principal Investigators 


•Detection 


•Diagnosis 


•Prognosis 


•Mitigation 


•Integrity Assurance 


API: John Lekki 


API: Rick Ross 


API: Kai Goebel 


API: Eric Cooper 


API: Eric Cooper 


•Level 2 



Aircraft Systems 


Airframe 


Propulsion Systems 




Traditional Aircraft Subsystems - well represented in Levels 1,3 and 4 


Software 
Lead: Paul Miner 


Newly Recognized Aircraft Subsystem 


•Level 1 Lead Researchers 


•Advanced Sensors and 


•Modeling 


•Advanced Analytics and 


•Verification and Validation 

Materials 



Complex Systems 


Lead: Steve Jacklin 

Lead: Tim Bencic 


Lead: Kevin Wheeler 


Lead: Nikunj Oza 





Recent Safety Advances 


U.S. and Canadian Operators Accident Rates by Year 


Fatal Accidents - Worldwide Commercial Jet Fleet - 1959 Through 2006 
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